Overview

Today, we will return to exploring Russians’ support the war in Ukraine using a public opinion survey from Russia conducted by Alexei Miniailo’s “Do Russians Want War” project.

The survey was conducted by phone using a random sample of mobile phone numbers to produce a sample of respondents representative of the population in terms of age, sex, and geography. It was in the field from February 28 to March 2.

First, we will explore how support for the war varies with the demographic predictors age, sex and education. We will compare the results of modeling this relationship using Ordinary Least Squares regression and Logisitic Regression

We’ll talk more about the technical aspects of logistic regression next week. For today we’ll simply compare the results from these two approaches.

Next, we will gain insight into how our estimates from these models might vary using the statistical process of bootstrapping. Specifically, we will simulate the idea of repeated sampling that is the foundation of frequentist interpretations of probability, by sampling from our sample with replacement.

We’ll walk through the mechanics of simulation together. Then you’ll quantify the uncertainty described by these bootstrapped sampling distributions.

Finally, we’ll see what other questions we might ask of these data and practice various skills we’ve developed through out the course.

Plan to spend the following amount of time on each section

  1. Get set up to work (5 minutes)

  2. Model the relationship between demographic predictors and war support using OLS and Logistic regression (20 minutes)

  3. Assess the uncertainty around your estimated coefficients (15 minutes)

  4. Quantify the uncertainy described by your sampling distributions (10 minutes)

  5. Explore other relationships in the data. (30 minutes)

You will work in your assigned groups. Only one member of each group needs to submit the html file produced by knitting the lab.

This lab must contain the names of the group members in attendance.

If you are attending remotely, you will submit your labs individually.

You can find your assigned groups in previous labs

Goals

Today’s lab covers a little bit of everything. You will

  • learn how to model binary outcomes using logistic regression

  • develop an intuition for how to think about uncertainty using simulations to describe what might have happened

  • practice formulating, estimating, presenting and interpreting regression models.

Workflow

Please knit this .Rmd file

As with every lab, you should:

  • Download the file
  • Save it in your course folder
  • Update the author: section of the YAML header to include the names of your group members in attendance.
  • Knit the document
  • Open the html file in your browser (Easier to read)
  • Write yourcode in the chunks provided
  • Comment out or delete any test code you do not need
  • Knit the document again after completing a section or chunk (Error checking)
  • Upload the final lab to Canvas.

1 Get setup to work

1.1 Load Packages

First lets load the libraries we’ll need for today.

the_packages <- c(
  ## R Markdown
  "kableExtra","DT","texreg","htmltools",
  "flair", # Comments only
  ## Tidyverse
  "tidyverse", "lubridate", "forcats", "haven", "labelled",
  "modelr", "purrr",
  ## Extensions for ggplot
  "ggmap","ggrepel", "ggridges", "ggthemes", "ggpubr", 
  "GGally", "scales", "dagitty", "ggdag", "ggforce",
  # Data 
  "COVID19","maps","mapdata","qss","tidycensus", "dataverse",
  # Analysis
  "DeclareDesign", "boot"
)

# Define function to load packages
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}

ipak(the_packages)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

There are three packages in particular that we’ll need to maker sure are installed and loaded

  • modelr
  • purrr
  • broom

If ipak didn’t return TRUE for each of these, please uncomment and run:

# install.packages("modelr")
# install.packages("purrr")
# install.packages("broom")

1.2 Load the data

Next we’ll load the recoded data for the lab

Our primary outcome of interest (dependent variable) for today is a binary measure of support for war:

  • support_war01 “Please tell me, do you support or do not support Russia’s military actions on the territory of Ukraine?” (1=yes, 0 = no)

Our key predictors (independent variables) are the following demographic variables:

  • age “How old are you?”

  • sex “Gender of respondent” (As assessed by the interviewer)

  • education_n “What is your highest level of education (confirmed by a diploma, certificate)?” (1 = Primary school, 2 = “High School”, 3 = “Vocational School” 4 = “College”, 5 = Graduate Degree)1

load(url("https://pols1600.paultesta.org/files/data/df_drww.rda"))

2 Model the relationship between demographic predictors and war support using OLS and Logistic regression.

2.1 Fit the models

Please estimate the following models:

  • An OLS regression model called m1 using lm()
  • A Logistic regression model called m2 using glm() with family=binomial
# OLS
# m1 <- ???

# Logisitic 
# m2 <- ???

2.2 Display the results in a regression table

Next, please display the results of your regressions in a table using htmlreg()

# Regression Table

2.3 Produce predicted values from the model

The coefficients from a logistic regression aren’t easy to directly interpret.

Instead, we will produce predicted values for each model

To do this, we will need to create a prediction dataframe called pred_df Every variable in your model, needs to be represented in your prediction data frame.

  • Use expand_grid() to create a data frame where

    • age varies from 18 to 99
    • sex is held constant at “Female”
    • education_n is held constant at its mean
## Create prediction data frame
# pred_df <- expand_grid(
#   age = ??? : ???,
#   ??? = ???,
#   ???
# )

Then you use the predict() function to produce predicted values from each model.

Save the output of predict() for m1 to a new column in pred_df called pred_ols.

For m2 you need to tell are to transform the predictions from m2 back into the units of the response (outcome) variable, by setting the argument type = "response". Save the output of predict() for m1 to a new column in pred_df called pred_logit.

# #Predicted values for m1
# pred_df$pred_ols <- predict(???,
#                             newdata = ???)
# #Predicted values for m2
# #Remember to add type = "response"
# pred_df$pred_logit <- ???

2.4 Plot the predicted values and interpet the results

Now we can compare the predictions of OLS and Logistic regression by plotting the predicted values of support for the war from each model.

To produce this plot you’ll need to

  • specify data (you want to use the values from pred_df)
  • map values from your data aesthetics in ggplot
    • put age on the x axis and and pred_ols on the y-axis
  • specify the geometries to plot
    • add two geom_line() to the plot
    • leave the first one empty (e.g. geom_line())
    • for the second, specify a new aes of y=pred_logit
# data %>%

# aesthetics with ggplot()

# geometries geom_line()

How do the predictions of the two models compare

So the predictions from OLS produce impossible values (levels of support above 100 percent) at for very old respondents, while the predictions from logistic regression are constrained to be between 0 and 1.

If we think that logistic regression provides a more credible way of modeling support for the war, then our OLS regression appears to overstate the level of support among young and old, while possibly understating the level of support among the middle age. The differences aren’t huge – a few percentage points – but for a binary outcome we will often prefer to model it with logistic regression.

Also note that marginal effect for age in the logistic regression is not constant. An increase in age from 25 to 26 is associated with a larger increase in support, than an increase in age from 75 to 76.


3 Assess the uncertainty around your estimated coefficients

How confident are we that the true relationship between age and support for the war is positive? How different might the coefficient be if we had taken a different random sample of Russians in early March 2022?

In this section, we will explore how we can simulate the idea “repeated” sampling using just the sample we have to quantify uncertainty about the estimates in our model.

To do this we will.

  1. Take 1,000 bootstrap samples from df_drww with replacement using the bootstrap function from the modelr package.
  • By sampling with replacement, we allow the same observation to be included multiple times (or not all), in our bootstrapped sample.
  1. For each bootstrap sample, we will estimate the same model as above, using the map function from the purrr package.
  • We will end up with 1,000 bootstrapped estimates for each coefficient in our model
  • These bootstrapped estimates describe the sampling distribution of coefficient.
  1. Put the coefficients from this bootstrapped model into into a tidy data frame, using the tidy function from the broom package, and the map_df function from the purrr package.

  2. Visualize the sampling distribution for coefficients in our model.

In the code below I demonstrate this process for m1. Then you will repeat this process for m2

3.1 Take 1,000 bootstrap samples from df_drww

Below we create 1,000 boostrapped samples

# Make sure these packages are loaded
library(modelr)
library(purrr)
library(broom)
# Set random seed for reproducability

set.seed(1234)

# 1,000 bootstrap samples
boot <- modelr::bootstrap(df_drww, 1000)

Let’s take a moment to understand what boot is and why we’re sampling with replacement.

The object boot contains 1,000 bootstrapped samples from df_drww.

If we look at the first bootstrap we see:

boot$strap[[1]]
<resample [1,807 x 42]> 1308, 1018, 1125, 1004, 623, 905, 645, 934, 400, 900, ...

The numbers 1308, 1018, 1125, 1004, 623, 905, ... correspond to rows in df_drww. So person 1308 is the first observation in this boot strap sample, then person 1018 and so on.

Because we are sampling with replacement observations from df_drww can appear multiple times. In our first bootstrap sample:

  • 666 observations appeared once
  • 342 appeared twice
  • 105 appeared three times
  • 27 appeared four times
  • 5 appeared five times.
  • 2 appeared six times
table(table(boot$strap[[1]]$idx))

  1   2   3   4   5   6 
666 342 104  27   5   2 

Why would we want to sample with replacement?

Well, what we’d really love are 1,000 separate random surveys all drawn from the same population in the same way.

Since that’s not feasible, we instead use the one sample we do have to learn things like how much our estimate might vary in repeated sampling. Efron (1979) called this procedure “bootstrapping” after the idiom “to pull oneself up by one’s own bootstraps”

We do this by sampling from our sample with replacement.

When we sample with replacement, we are sampling from our sample, as our sample was sampled from the population.

With replacement means that some observations will appear multiple times in our bootstrapped sample (while others will not be included at all).

When an observation appears multiple times in a bootstrap sample, conceptually, we’re using that original observation to represent the other people like that observation in the population who – had we taken a different sample – might have been included in our data.

Note each bootstrap sample is a different random sample with replacement. In our second bootstrap sample, one observation (person 1496) appeared five times.

table(table(boot$strap[[2]]$idx))

  1   2   3   4   5   6 
661 342 105  31   1   3 
# Person 406
sum(boot$strap[[2]]$idx == 1496)
[1] 5
# Person 1 is not in boostrap 2
sum(boot$strap[[2]]$idx == 1)
[1] 0

3.2 Estimate 1,000 models from the 1,000 bootstrapped samples

Now let’s estimate our model for each bootstrapped sample, using the map function.

  • In the code below, for every sample in boot, map estimates the model lm(support_war01 ~ age + sex + education_n) plugging in the bootstrap sample into the data=..
# bootstrap simulations
bs_ols <- purrr::map(boot$strap, ~ lm(support_war01 ~ age + sex + education_n, data =.))

The end result is a large list with 1,000 separate linear regression models estimated on each bootstrapped sample.

The coefficients from each bootstrap vary from one simulation

# First bootstrap
bs_ols[[1]]

Call:
lm(formula = support_war01 ~ age + sex + education_n, data = .)

Coefficients:
(Intercept)          age      sexMale  education_n  
   0.305019     0.009746     0.081039    -0.024142  

to the next:

# Second boostrap
bs_ols[[2]]

Call:
lm(formula = support_war01 ~ age + sex + education_n, data = .)

Coefficients:
(Intercept)          age      sexMale  education_n  
   0.245000     0.009142     0.095631    -0.009285  

Because they’re estimated off of different bootstrapped samples.

We will visualize and quantify that variation to describe the uncertainty associated with our estimates.

But first, we need to transform our large list of linear models, into a more tidy data frame that’s easier to manipulate.

3.3 Tidy the results of our bootstrapping

In the code below we transform this large list of models into a tidy data frame of coefficients.

# Tidy bootstrapp sims
bs_ols_df <- map_df(bs_ols, tidy, .id = "id")

In the resulting data frame the id variable identifies the bootstrap simulation (1 to 1,000), the term variable indentifies the specific coefficient from the model estimated for that simulation.

head(bs_ols_df)
# A tibble: 6 × 6
  id    term        estimate std.error statistic  p.value
  <chr> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 1     (Intercept)  0.305    0.0522        5.84 6.29e- 9
2 1     age          0.00975  0.000702     13.9  3.31e-41
3 1     sexMale      0.0810   0.0222        3.65 2.73e- 4
4 1     education_n -0.0241   0.0107       -2.26 2.39e- 2
5 2     (Intercept)  0.245    0.0533        4.60 4.61e- 6
6 2     age          0.00914  0.000731     12.5  3.36e-34

3.4 Plot the bootstrapped sampling distribution of the coefficient for age

Finally, let’s get a sense of how our coefficients could vary.

Specifically, let’s compare the the observed coefficient from m1 for age, to the bootstrapped sampling distribution of coefficients in bs_ols_df

  • First, we’ll create a basic plot called p_ols_age that shows the distribution of the coefficients for age from our simulation
p_ols_age <- bs_ols_df %>%
  filter(term == "age") %>%
  ggplot(aes(estimate))+
    geom_density()
Error in ggplot(., aes(estimate)): could not find function "ggplot"
p_ols_age
Error in eval(expr, envir, enclos): object 'p_ols_age' not found

Next we’ll add some additional geometries and labels to our figure

  • First we’ll put a rug to show the individual coefficients
p_ols_age +
  geom_rug() -> p_ols_age
Error in eval(expr, envir, enclos): object 'p_ols_age' not found
p_ols_age
Error in eval(expr, envir, enclos): object 'p_ols_age' not found
  • Then we’ll add a vertical line where our observed coefficient on age
p_ols_age +
  geom_vline(xintercept =  coef(m1)[2],
             linetype = 2) -> p_ols_age
Error in eval(expr, envir, enclos): object 'p_ols_age' not found
p_ols_age
Error in eval(expr, envir, enclos): object 'p_ols_age' not found
  • Finally, let’s add some nice labels
p_ols_age +
  theme_bw()+
  labs(
    x = "Age",
    y = "",
    title = "Bootstrapped Sampling Distribution of Age Coefficient"
  ) -> p_ols_age
Error in eval(expr, envir, enclos): object 'p_ols_age' not found
p_ols_age
Error in eval(expr, envir, enclos): object 'p_ols_age' not found

Congratulations you’ve just produced and visualized your first bootstrapped sampling distribution!

Conceptually, this distribution describes *how much we would expect the coefficient on age in model to vary** from sample to sample.

Just from eyeballing the figure above, it looks like the observed the relationship between age and support for war or 0.009 could be about as high as 0.011, and as low as 0.007.

Of course, as the budding quantitative social scientists that we are, we can do better than just eyeballing the data.

4 Quantify the uncertainy described by your sampling distributions

Specifically, please calculate the standard deviations, 97.5 and 2.5 percentiles of each coefficient in bs_old_df

You’ll want to

  • group_by() the term variable in bs_ols_df
  • summarize() the output of functions applied to the estimate column in bs_ols_df
# calculate the standard deviations, 97.5 and 2.5 percentiles

Please compare these these estimates to those by the following functions

# summary(m1)
# confint(m1)

5 Explore other relationships in the data.

Finally, let’s take some time to explore other variables and relationships in the data.

Please do the following:

  • Research Question: Formulate a brief research question here

  • Data: Identify the relevant outcome and predictor variables in the data.

    • Outcome:
    • Key Predictor:
    • Covariates:
      • Covariate 1
# Explore the data
  • If necessary, transform and recode the data as needed
# Data transformations if necessary
  • Models and Expecations Specify a model or models to test your question in terms of our outcome and predictor variables.

\[\text{Outcome} = \beta_0 + \beta_1 \text{Key Predictor} + \beta_2 \text{Covariate}\]

  • Expectations Discuss the expected sign and significance of the coefficients in your model. If you think a coefficient could be either positive or negative, explain what each of these results means in the context of your question.

  • Estimation Estimate a model or models to test your question.

# Models
  • Results Present and interpret your results using tables and figures.
# Regression table
# Figures

  1. I think, google translate was a bit unclear. But higher numbers equal more education.↩︎